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Abstract — We study the fundamental problem of power alloca- 
tion over multiple Gilbert-Elliott communication channels. In a 
communication system with time varying channel qualities, it is 
important to allocate the limited transmission power to channels 
that will be in good state. However, it is very challenging to 
do so because channel states are usually unknown when the 
power allocation decision is made. In this paper, we derive an 
optimal power allocation policy that can maximize the expected 
discounted number of bits transmitted over an infinite time span 
by allocating the transmission power only to those channels that 
are believed to be good in the coming time slot. We use the 
concept belief to represent the probability that a channel will be 
good and derive an optimal power allocation policy that estab- 
lishes a mapping from the channel belief to an allocation decision. 
Specifically, we first model this problem as a partially observable 
Markov decision processes (POMDP), and analytically investigate 
the structure of the optimal policy. Then a simple threshold-based 
policy is derived for a three-channel communication system. By 
formulating and solving a linear programming formulation of 
this power allocation problem, we further verified the derived 
structure of the optimal policy. 

I. Introduction 

Communication over the wireless medium is subject to 
multiple impairments such as fading, path loss, and inter- 
ference. These effects degrade the quality of received signal 
and lead to transmission failures. The quality of the radio 
channel is often random and evolves in time, ranging from 
good to bad depending on the propagation conditions. To 
cope with the changing channel quality and achieve a better 
channel utilization, it is important to adopt link adaptation 
schemes whereby data/coding rate and transmit power of the 
transmitted signal are adaptively adjusted according to the 
channels conditions Q, ||2|, O, IH. 

Adaptive power control is an important technique to se- 
lect the transmission power of a wireless system according 
to channel condition to achieve better network performance 
in terms of higher data rate or spectrum efficiency [1],[2]. 
There has been some recent work on power allocation over 
stochastic channels O, lO, Q, but the problem of optimal 
power allocation across multiple dynamic stochastic channels 
is challenging and remains largely unsolved from a theoretical 
perspective. 

We consider a wireless communication system operating on 
N{N > 3) parallel transmission channels. Each channel is 



modeled as a time slotted two-state Markov model known 
as the Gilbert-Elliot channel. This model assumes that the 
channel can be in either a good state or a bad state. The 
channel in a good state can transmit at a certain rate suc- 
cessfully but a channel in bad state will lead to transmission 
failure and therefore suffer data loss. We assume all channels 
in the system are statistically identical and independent of 
each other. Our goal is to allocate the total transmission 
power only to channels in good state so as to maximize the 
expected discounted number of bits transmitted over an infinite 
time span. Since the channels sates are unknown at the time 
this power allocation decision is made, this problem is more 
challenging than it looks like. 

There have already been some related works on the 
decision-making problem over Gilbert-Elliott channels in the 
literature. In |3| and |4|, the authors used Markov Deci- 
sion Process (MDP) tools to establish an optimal threshold 
strategies that minimize the transmission consumption and 
maximize the throughput over one Gilbert-Elliott channels. In 
1 8 1, the authors defined three transmitting actions and solved 
the problem of dynamically choosing one of them to maximize 
the expected discounted number of bits transmitted. In | 9 |, the 
authors study the problem of choosing a transmitting strategy 
from two choices emphasizing the case when the channel 
transition probabilities are unknown. The work in |10| and 
1 11 I is most relevant to the work in this paper, the differences 
between these three are as follows: |10| addresses power 
allocation problem in the context of two identical channels 
and three allocation strategies: betting on channel 1, betting on 
channel 2 and using both channels, whilst 1 11 1 added one more 
action of using none of the channels and introduced penalty 
caused by transmission on a bad channel. The spirit of this 
paper is similar to those in |10| and 1 11 1, but addresses a more 
challenging setting involving N identical channels(A^ > 3). 
When N is large, the power allocation decisions becomes 
much more complicated, and it is more difficult to derive and 
express the optimal policy. 

In this paper, we formulate our power allocation problem 
as a partially observable Markov decision process (POMDP). 
We then treat the POMDP as a continuous state MDP and 
develop the structure of the optimal policy (decision). Our 
main contributions are summarized as follows: (1) we formu- 



late the problem of dynamic power allocation over multiple 
parallel Gillber-Elliott channels using the MDP theory, 2) 
we theoretically prove some key properties of the optimal 
policy for this particular problem, and derive the exact optimal 
policy for the three-channel system, (3) through simulation 
based on linear programming, we verify the structure of the 
optimal policy and demonstrate how to numerically compute 
the thresholds and construct of the optimal policy when system 
parameters are known. 

II. Problem Formulation 
A. Channel model and assumptions 

In this paper, we consider a wireless communication system 
operating on N parallel channels. We assume that these 
channels are statistically identical and independent of each 
other. Each channel is modeled by a time slotted Gilbert- 
Elliott channel which is a one dimensional two- state Markov 
chain G {1, 2, TV}, t G {1, 2, oo}) (i is the index 

of channel and t is time slot). Gi^t = 1 means the channel 
is in good state in time slot t, and Gi^t = means the 
channel is in bad state in time slot t. The state transition 
probability is denoted by: Pr[Gi^t = MGi^t-i = 1] = Ai and 
Pr[Gi^t = l\Gi^t-i = 0] = Ao,/g {1,2,..., TV}. We assume 
the state transitions happen at the beginning of each time slot 
and share a positive correlation assumption that Aq < Ai which 
means the probability of retaining in good state is higher than 
that of recovering from a bad state. 

The total transmitting power of the communication system 
is P. At the beginning of each time slot, system needs to 
allocate the limited power to the channels optimally. Let Pi (t) 
denote the power allocated to channel i at time t, we have: 

N 

p = Y.Pi{t). (1) 

i=l 

We assume that the states of channels are unknown at the 
beginning of each time slot. If channel i is used in time slot t 
(Pi{t) > 0), the state of channel i in slot t is revealed at the 
end of that slot through a feedback mechanism. Otherwise, 
if channel i is not used (Pi{t) = 0), its exact state during 
time slot t remains unknown. Therefore this power allocation 
problem is challenging because decisions have to be made 
when current channel states are unknown. 

To simplify the problem, we adopt the following power 
allocation strategies. At the beginning of each time slot, the 
system chooses k (hopefully good channels) out of the TV 
channels and allocates total power P to the k channels equally. 
So each of the selected channel is allocated P/k of the 
transmission power. If a channel is allocated P/k power, 
there are two different consequences: 1) the channel is in 
good state and sends Rk{k < N) bits of data successfully 
(reward); 2) the channel is in bad state and suffers Ck{k < N) 
bits of data loss due to poor channel quality (penalty). We 
assume that Rk^ < Rk, < ^Rk2^ < Ck, < ^Ck^ 
{1 <ki <k2 < N). For all 1 < /c < A/", we have Rk > Ck- 



If a channel is not allocated any transmission power, it has 
zero reward and zero penalty. 

We define an n-dimensional vector = (a^^i, ai,2, <^i,Ar) 
to denote allocation action z, where l<i<2^, a^^jG{0,l}, 
where a^^j = 1 means channel j is used in action i and a^^j = 
means channel j is not used in this action. Because the 
total number of channels is A^ and each channel can be either 
used or not, there are 2^ possible allocation actions. We use 
B = {cXi, i G 1, 2, 2^} to denote the set of all 2^ different 
allocation actions. Define \\oLi\\ = a| ^ + af 2 + ••• + at = 
+ <^i,2 + ••• + <^i,Ar = k as the number of used channels 
in this action. ((When k is large, the system spreads the risk 
of data loss to more channels and is more likely to get a 
mediocre reward. When k is small, the system bets on less 
channels and might lead to better reward. The focus of this 
paper is to find an optimal allocation policy that maximizes 
the long term discounted reward.)) 

B. Formulation of the Partially Observable Markov Decision 
problem 

As described above, at the beginning of each time slot, the 
system needs to choose an appropriate strategy in order to 
maximize the data transmitted in the long term. Due to the 
fact that the exact channel state is not observable when this 
decision is made, this problem can be described as a Partially 
Observable Markov Decision Process (POMDP). In (TJl, it 
is shown that given the past history, a sufficient statistic for 
determining the optimal policy is the conditional probability 
that the channel is in the good state at the beginning of the 
current time slot which is called the belief. We denote the 
belief by a N-dimensional vector = {xi^f,X2,t-, ••••,XN,t), 
where Xi^t = Pr[Gi^t = i ^ {1, 2, A"}, Ht is all the 

history before time slot t. By introducing the belief, we can 
convert the POMDP into a Markov Decision Process (MDP) 
with an uncountable state space O = ([0, 1], [0, 1], [0, 1]). 



Define policy tt as the decision-making rules which is a 
mapping from the state space O to the actions space B. Define 
V^{p) as the expected discounted number of data transmitted 
with initial belief p = (pi,P2, •••,PAr), where pi = Pr[Gi^o = 
l|^o]=^i,o, ^ G {1,2, A"}. We have: 

00 

y-(p)=i?-[^/?*5a,(xt)|xo=p] (2) 

where is the expectation given policy tt, /3 is the discount 
factor, t is time slot, G B denotes the action taken in time 
slot t, and QaA^t) denotes the expected immediate reward 
when choosing action at given the belief x^. Let oli^ denote 
the action at, then \\oLi^\\ is the number of channels used in 
this action, we have: 

N 

(3) 

Let set mt = {mi, m2, m||Q;.j|}, G {1,2,..., A"} be 
the set of channels chosen by action a^^, rui ^ rrij {i ^ j), 



equation ^ can be rewritten as: 



Now we define the value function V^(p) as: 

V(p) = max V^(p) VpGO 



(5) 



A poHcy is called stationary if it is a function mapping the 
state space O to action space B. It is proved that there exists 
a stationary policy tt* that satisfies V{p) = (p) and also 
the Bellman equation |[T3]| : 



V{p) = max {Vaip)} 

aGB 



(6) 



where K(p) denotes the value acquired when the belief is p 
and the immediate action is a: 



K(p)=^a(p) + /3^^[ny)|xo = P, 



ao 



(7) 



where y denotes the belief at the beginning of next time slot 
when action a is taken, denotes the expectation of total 
reward when the belief of next time slot is y. 

Next we discuss the expression of Va(p). For each action 
a = cxi G B, there are two types of channels: used and unused. 
For a used channel j, it is allocated P/||ai|| transmission 
power, thus it will have immediate reward PjR\\a-\\ and 
immediate loss {1 — pj)C\\ct.\\. Since the channel state in the 
current time slot is revealed at the end of this time slot through 
feedback, the belief of channel j in the next time slot will be 
either Ai (if channel j is in good state in the current time slot) 
or Ao (if channel j is in bad state in the current slot). 

For any unused channel j, there will be no immediate 
reward or loss, and there is no feedback to reveal the channel 
state. Therefore, the belief in the next time slot is calculated 
as: 

T{pj) = {l-pj)Xo^PjXi = apj + Ao (8) 

where a = Ai — Ao. 

For ease of notation, we omit the subscript i of cxi and use 
a to denote a certain action taken in a certain time slot in the 
following discussions. Let m = {mi, m2, m||c,||}, m/^ G 
{1, 2, TV} be the set of channels used in action a. Let cpi = 
(^i,i,^i,2, ...,^i,||«||),^i,/c e {0,1}, e {1,2, ||a||} de- 
note the state of the used channels in the elapsed time slot. 
Since each of the used channel may be in good or bad state, 
the total number of possible states of the ||a|| used channels 
is 2ll"ll, and we use ^ = = 1, 211^11} to denote the 
set of all possible states of used channels. For the convenience 
of notation, we represent the probability of state ipi as 



where 



fi^i) = n ^(^Phk) 

k=l 

\pmk if(/?i,/c = l 

I 1 - Pruk if ^i,k 



(9) 



(10) 



For each (pi, the corresponding system belief in the next time 
slot is y^. = {yt,y2, where 

Ao if j rrik and ^i^k 
y] = S ^1 if j = ^/c and Lpi^k 1 (H) 
T{pj) otherwise 

From (|9|)-(pTl), we know that the belief of next time slot will 
be y^. with the probability of /((^i). So the conditional value 
function Va(P) is calculated as: 



^«(p) 



ll«ll 

^^Pmk{R\\oc\\ ^C\\oc\\) - ||a||C||c,|| 



(12) 



More specifically, the last term of (12) can be written as 

= (1 -PmJ---(l -PmM)^(y(0,0,...,0)) 
+ Pmi(l -Pm2)---(1 -PmM)^(y(l,0,...,0)) 
+ (1 -Pmi)Pm2 •••(1 -PmM)'^(y(0,l,...,0)) 
+ ••• 

+ (1 -Pmi)(l -Pm2)---PmM^(y(0,0,...,l)) 
+ PmiPm2 •••(1 -PmM)'^(y(l,l,...,0)) 
+ ••• 

+ (1 - Pmi) • • -PrnM-iPruMyiylQ... ,1,1)) 
+ ••• 

~r PmiP'm2 ' ' ' Pttlm ,l,...l) (13) 

where M = \\ol\\. The Bellman equation ^ can then be 
expressed as: 

F(p)=maxF«(p) (14) 

III. Structure of the Optimal Policy 

In this section, we will first study the structural features 
of the optimal policy, and then derive the optimal policy for 
power allocation over three identical channels. 

A. Properties of value function 

Lemma 1: The value function 14 (p), a G B is affine in pj 
and the following equality holds: 

Va(Pl,P2, ...,Pj-l,Cp + (1 - C)p\pj^i, ...,Pn) 

+ (1 - c)Kbi,P2, ...,PAr) (15) 

where < c < 1 is a constant, j G {1, 2, N}. In this paper 
we use the following definition of "affine": h{x) is said to be 
affine with respect to x if h{x) = ax -\- c with constant a and 
c. 



Proof: The equality in (15) naturally holds if 14(p), a G 



B is affine in pj for all j. So we only need to prove the first 
half of the lemma. 



Suppose the system chooses action a = a in a certain time 
slot. Let M = ||a|| be the number of used channels, m = 
{mi,m2,...,mM},m^- ^ {1, 2, Ar}(j = 1,...,M) be the 
set of channels chosen by action a. First we prove that Lemma 
1 is true for used channels in m. It is clear from equation ^12) 
that the first term on the right side of equation ([12]) is affine 
in prrij U = 1 , . . . , M) , and from equation (fTsl) it is clear that 
the last term on the right side of equation ([12|) is also affine 
in prrij {j = 1 , . . • , M) . Therefore we say that for each used 
channel j( j G m), the value function K(p) is affine in pj. 

Next we need to prove that Va{p) is also affine in pj for 
unused channel j{j ^ {^i, ^2, ^m})- From equation 
( p^ , we can see that the first and second terms on the 
right side of the equation do not have the term pj {j ^ 
{mi, m2, mAf}), so we just need to consider the third 



term /3 Ec^.g^ /(^^)^(y^J- ^^^^ equation ^ and ([13]), 
we know that if V{y'^.) is affine in pj, the lemma holds. 

From ([T4]), we know V{y^.) = ^^'(y^.), where a' is 
the optirnal action to maximize V{yl). If channel j is 
used in action a^ then according to (V2\ and the fact that 
T{pj) = (1 — Pj)Xo + PjXi = apj + Ao is affine in pj, we 
can say V^(y^.) = ^^'(y^.) is affine in pj. If channel j is 
not chosen in action a^ we have y* = T{pj), then V{y'^.) 
can be expressed as: 

ny;j = Vo,,{yly*2,...,T{pj),...,y%) 

M 

= V y*r^' {Rm + Cm) - MCm 

f ^ k 

k=l 

+p 5]/(<^.)y(y;*) 



M 



= yZy^m' (Rm + Cm) - MCm 

< ^ k 



k=l 



+p[f{0, 0, 0)Viyr, T\pj), y%*) 
+ ■■■ 

+/(1, 1, l)V{yr, THpj), y*N*)] (16) 

where subscript m'^ denotes the index of chosen channel in 
action a^ and y^* denotes the corresponding system belief in 
the next time slot, and T^{p) is defined as: 



1 -cr 



(1 - cr^) + cr>. (17) 



From ( 16) it is clear that V{y^.) will be affine in pj as soon 
as the system choose channel j and allocate power to it. If the 
system keeps not choosing channel j till n goes to infinity, 
V(y^J will become V(ci, C2, cat) (ci,...,CAr are 
constants) since T'^{p) when n ^ 00. In this situation, 

F(y^.) is also affine in pj. 

From all above, we prove that Va{p)^a G B is affine in pj. 

■ 

Lemma 2: The value function V^(p) is convex in pj and the 



following inequality holds: 

K(Pl,P2, ...,Pj-l,Cp + (1 - C)p,pj^i, ...,Pn) 

+ (1 - c)K(pi,P2, ...,Pj-i,p ,Pj+i, ...,Pn) (18) 

Proof: The inequality holds when V^(p) is convex in pj. 
So we just need to prove the convexity of V^(p). Let V^{p) 
be the expected reward when the decision horizon spans only 
n time slots. 

When n = 1, from equation (|7| and (12), we have: 



T/i(p) = mag{Fi(p)} 

= max{^Jp)} 



^ i—1 



)-\MC\ 



ii«ii 



(19) 



We can easily notice the fact that every element in set 

{Y}1=i PmAR\\oc\\ +C'||c,||) - ||a||q|c,||} is affine and non- 
decreasing. So V^^(p) is convex in pj. 

Next, we assume V^^(p) is convex in pj, k > 1, and we 
now prove V^~^^{p) is also convex in pj. We have: 



where 



V'+'{p) = m^x{V:+^{p)} 



ii«ii 



(20) 



V^+Hp) = ^Pm.(^?||«||+C||«||)-||a||q|«|| 

i=l 



(21) 



The first and second term in equation (21 ) are both affine in 
Pj , so they are convex in pj . Next we consider the third term 
in ( [2T] ). From ( [T7] ) and ( [T3] ), we know that each element in 
the third term f3^^^^^f{(Pi)V^{y^J is either affine in pj 
(when ?/* = Ao or Ai) or convex in pj (when i/j = T{pj)). 
So the third term is also convex in pj . Now we have proved 
V^+^(p) is also convex in pj. 

From all above, we can draw the conclusion that for all 
n > 1, V^"^(p) is convex in pj. Since V{p) is the infinite form 
of V^{p) when n — cx), so V{p) is convex in pj. ■ 

Lemma 3: Suppose a belief vector = {p'i^p'2^ •••^Pat) i^ 
obtained by randomly swapping the positions of the elements 
in belief vector p = {pi-,P2-, ••••,Pn) (0 < Pj < 1), the 
following equality holds: V^(p) = V{p'). 

Proof: First, we prove that for all a G B, there exists 
a' G B that satisfies 14. (p) = Voc'{p')- 

For action a, let M = | |a| | be the number of used channels, 
mi , 7722 , ... , rriM be the channel indexes and Pmi , Pm2 5 • • • 5 PmM 
be the believes of the used channels. Since p and p' have 
the same elements (in different order), we can find channels 



, m2 , . . . , m'j^ that satisfy the condition that 



{1, 2, M}). That is, we can find action a' that satisfies 



^tt(p) = ^tt'(pO' where m- indicates the index of used 
channel in action a! . 

From above, we can estabHsh a bijection / : p ^ 
that satisfies Va(p) = Vcx'i^')- Consequently, we have 
maXoc{ycyi(^y\ = maXoc'{yoi'(^'y\' Therefore, V(p) = 

y(p'). ■ 

B. Properties of the decision regions of policy tt* 

Define as the decision region of action a. That is, action 
a is optimal when belief is in 



{p|F(p) = K(p),aGB} 



(22) 



Definition 1: If given (pi, ...,PAr), 

(pi, ...,p_^_l,X2,Pj + l, ...,PAr) G ^a, ^1 < ^2, 1 < j < A^, 

Vx G [xi,X2], we have (pi, ...,PAr) G 

then we say is contiguous along dimension. 

Theorem 1: is contiguous along pi,P2, •••,PAr dimen- 
sion (a G B). 

Proof: Here we prove that is contiguous along p\ 
dimension, the rest can be proved in a similar manner. 

Let (xi,p2, ...,Piv), (3^2,^2, ...,Piv) G and xi < X2, we 
have V{xx,p2, •••,Vn) = K(^i,P2, ...,PAr), V(x2,P2, ...,PAr) 
= Va{x2jP2j '"jPn)- G [xi,X2], X Can be expressed as 
cxi + (1 — c)x2, where < c < 1. 

From lemma 1 and lemma 2, we have: 

V{x,P2,...,Pn) 

= V{CXI + (1 - C)X2,P2, ...,Piv) 

< cl/(xi,p2, "',Pn) + (1 - c)F(x2,P2, ...,PAr) 

= cVa{xi,P2, ...,PAr) + (1 - C)K(^2,P2, ...,PAr) 

= Va{cXi + (1 - C)X2,P2, ...,Piv) 

= K(^,P2,...,Piv) 

< V(x,p2,...,PAr) (23) 



From ([23| we have K(^,P2, ...,PAr) = V{x,p2, ...,Pn), 
that is, X G Therefore is contiguous along pi 

dimension. ■ 

C. Structure of the optimal policy over 3 -dimensional state 
space 

In order to visually demonstrate the structure of the optimal 
policy, we consider a system with 3 parallel channels in this 
section. In this system, each belief is a three-dimensional 
vector p G O = ([0, 1], [0, 1], [0, 1]). Each action is also a 
three-dimensional vector a = (ai, a2, aa), G {0,1}, j G 
{1,2,3}. It is clear that there are 8 different actions in 
total, each has a corresponding decision region. The following 
theorem summarises the features of each decision region. 

Theorem 2: ^(o,o,o) and are self- symmetric with 

respect to plane pi = p2, pi = ps and p2 = ps', *(o,o,i) and 
^(1,1,0) are self-symmetric with respect to plane pi = P2\ 
$(0,1,0) and $(i,o,i) are self- symmetric with respect to plane 
Pi = Ps', $(0,1,1) and $(i,o,o) are self- symmetric with respect 
to plane p2 = ps- ^(i,o,i) and $(o,i,i), *(i,o,o) and $(o,i,o) 
are mirror- symmetric with respect to plane pi = P2\ $(o,o,i) 



and $(1,0,0)' ^(i,i,o) and $(o,i,i) are mirror-symmetric with 
respect to plane pi = ps; $(o,o,i) and $(o,i,o), *(i,o,i) and 
^(1,1,0) are mirror- symmetric with respect to plane p2 = Ps- 
Proof: Let (^1,^2,^3) G $(0,0,0 ) ^ t hen we have 
V{pi,p2,P3) = V(o,o,o)(Pi,P2,P3). From ^ and lemma 3, 
we have: 



V(0,0,0)bl,P2,P3) 

pV{T{pr),T{p2),T{ps)) 
(3V{T{pr),T{ps),T{p2)) 
(3V{T{p2),T{p,),T{ps)) 
/3V{T{p2),T{ps),T{pr)) 
pV{T{ps),T{pr),T{p2)) 
(3V{T{ps),T{p2),T{pr)) 



(24) 



That is. 



V(0,0,0) (Pl,P2,P3) 
"^(0,0,0) (Pl,P3,P2) 
V(0,0,0) (P2,Pl,P3) 
"^(0,0,0) (P2,P3,Pl) 
V(0,0,0) (P3,Pl,P2) 
V(0,0,0) (P3,P2,Pl) 



(25) 



So $(0,0,0) is self- symmetric with respect to plane pi = P2, 
Pi = Ps and p2 = Ps- Similarly we can prove $ (1,1,1) is 
self- symmetric with respect to plane pi = P2, Pi = Ps and 

P2 =PS- 

Next we prove $(1,0,0) and $(o,i,o) are mirror- symmetric 
with respect to plane pi = P2- Let (^1,^2,^3) G $(1,0,0)^ th^n 
V{pi,p2,Ps) = V(i,o,o)(pi,P2,P3). From lemma 3, we have: 

V{p2^Pi^Ps) 

= V(o,l,0)(P2,Pl,P3) 

= Pi(i?i+Ci)-Ci + 

(3[p^V{T{p2),XuT{ps)) + {l-pi)V{T{p2),Xo^T{ps))] 
= Pi{Ri^C,)-Ci^ 

P[p,V{XuT{p2),T{ps)) + {l-p,)V{Xo^T{p2),T{ps))] 

= V(i,o,0)(Pl,P2,P3) 

= V{puP2,P3) (26) 

So {p2,Pi,Ps) G $(0,1,0), that is, $(1,0,0) and $(0,1,0) are 
mirror- symmetric with respect to plane pi = P2- The rest of 
the theorem can be proved in a similar way. ■ 
After obtaining the basic features of the decision regions, 
we now discuss the distribution of the decision regions in 
the 3 -dimension belief space. First we consider the 8 vertices 
of the cubic belief space: (0, 0, 0), (1,0,0), (0,1,0), (0,0,1), 



(1,1,0), (1,0,1), (0,1,1) and (1,1,1). From equation (12), 



it is straightforward to obtain the following result: 



For the edge = l,p3 = (Fig. [Tell, in the same manner 



(0,0,0) 
1^(1,0,0) 
y (0,1,0) 
y (0,0,1) 
y (1,1,0) 

v^(i,o,i) 

(0,1,1) 

1^^(1,1,1) 



= Vo.o) (0,0,0) 
= 1/(1,0,0) (1,0,0) 
= Vi,o) (0,1,0) 

= y(o,o,i) (0,0,1) 
= y(i,i,o)(i,i,o) 

= y(i,o,i)(l,0,l) 
= y(o,i,i)(0,l,l) 

= l^(l,l,l)(l,l,l) 



(0,0,0) G #(0,0,0) 
(1,0,0) G #(1,0,0) 
(0,1,0) G #(0,1,0) 
(0,0,1) G #(0,0,1) 
(1,1,0) G #(1,1,0) 
(1,0,1) G #(1,0,1) 
(0,1,1) G #(0,1,1) 

[(1,1,1) G #(1,1,1) 

(27) 

Next, we consider the 12 edges of the belief space cube. 
We take the plane = as an example to discuss the four 
edges on it. When = 0, we have: 



we have V(i,i,o) > "^(0,1,0) 



and V(i,i,o) > V(o,o,o)- Thus, the 



optimal action on this edge is either (1, 1, 0) or (1, 0, 0). From 
(|28|l we have: 



^B(,,,.o - T/b<,,o,o) = (1 + P2)(i?2 + C2) - 2C2 - i?l 

+ /?[p2^(Ai, Ai, Ao) + (1 - P2)V(Ai, Ao, Ao) 
-y(Ai,T(p2),Ao)] (31) 

Due to the convexity of V{-p), there exists 

R1-R2 + C2 + p[V{\i,T{p2), Ao) - V(Ai, Ao, Ao)] 



th 



R2 + C2 + F(Ai, Ai, Ao) - F(Ai, Ao, Ao) 



(32) 



^"(0,0,0) =/3F(T(pi),r(p2),Ao) 

V(ofi,i)=-Ci+pV{T{pi),T{p2),\o) 

V(o,i,o) =V2{Ri + Ci) - Ci + P\p2V{T{p^), Ai, Ao) 

+(l-p2)V'(T(pi),Ao,Ao)] 

V(i,o,o) =Vi{Ri + Ci) - Ci + /?[piV'(Ai, T(p2), Ao) 

+(l-pi)F(Ao,r(p2),Ao)] 

V^(o,i,i) =P2(i?2 + C2) - 2C2 + P\p2V{T{pi), Ai, Ao) 

+(l-p2)V'(T(pi),Ao,Ao)] 

V(i,o,i) =Pi(^2 + C2)-2C2 + /?[pinAi,T(p2),Ao) 

+(l-pi)F(Ao,r(p2),Ao)] 

V(i,i,o) = (Pi +P2)(i?2 + C2) - 2C2 + /9[piP2^^(Ai, Ai, Ao 

+(1 -Pi)P2V(Ao,Ai,Ao) +Pi(l -P2)V(Ai,Ao, 

+(l-pi)(l-P2)nAo,Ao,Ao)] 

V(i,i,i) = (pi + P2)(i?3 + Ca) - 3C3 + ;9[piP2l^(Ai, Ai, Ao) 

+(1 -pi)p2^(Ao, Ai,Ao) +pi(l -p2)V'(Ai, Ao, Ao) 

+(l-pi)(l-P2)nAo,Ao,Ao)] 

(28) 

In Section II, we assume that Rb < Ra < bRb/a, Cb < 
Ca < bCb/a and Ra > Ca(l < a < b < M), so we can 
learn from ([28l( that V(o,o,o) 



so that when p2 > Th2, (1,P2,0) G #(i,i,o), whenp2 < Th2, 

(1,P2,0) G #(0,1,0)- 

Using the symmetric properties in Theorem 2, we can easily 
derive similar results on the other planes and edges. So the 
structure of the optimal policy on the 6 planes of the cubic 
belief space is shown in Fig. [TJ where 



(thi 



Ci+/3[V(T(ifei),Ao,Ao)-V/(Ao,Ao,Ao)] 
i?i+Ci+y(Ao,Ai,Ao)-y(Ao,Ao,Ao) 



_ fli--R2+C2+/3[V"(T(i/t2),Ai,Ao)-y(Ai,Ao,Ao)] 
"''2 - fi2+C2 + V(Ai,Ai,Ao)-V(Ai,Ao,Ao) 

^i^ _ 2fl2-2fl3+C3+/3[V(T(i/i3),Ai,Ai)-V(Ai,Ai,Ao)] 
"''3 K3+C3+V(Ai,Ai,Ai)-y(Ai,Ai,Ao) 



(33) 



After the threshold on each edge is found, we next derive 
the structure of the optimal policy in the whole cube. 

Theorem 3: is a simple connected region extended from 
the vertices Va of the cubic belief space ([0, 1], [0, 1], [0, 1]), 
;^jhere 



Vn. = < 



^(0,1,0) > ^(0,1,1). 



> ^(0,0,1). 

V(i,o,o) > ^(1,0,1), ^(1,1,0) > ^(1,1,1)- Therefore, the optimal 
actions on this plane are restricted to the following four 
actions: (0, 0, 0), (0, 1, 0), (1, 0, 0), (1, 1, 0). 

On edge {pi = O^ps = 0}, according to lemma 2 and 
the assumption in Section II, we have: V(o,i,o) > ^(i,i,o) and 
^(0,1,0) > ^(1,0,0)- With this we know the optimal action on 
this edge is either (0, 1,0) or (0, 0, 0). From ( [28] ) we have: 

V(o,i,o)- V(o,o,o) = P2{R2 + C2) - 2C2 + P[p2V{Xo, Ai, Ao) 
+ (1 - P2)^(Ao, Ao, Ao) - F(Ao, T(p2), Ao)] 

(29) 

Due to the convexity of V{p), there exists 

, ^ Ci + l3[V{Xo, T{p2), Ao) - l^(Ao, Ao, Ao)] 
' i?i + Ci + T/(Ao,Ai,Ao)-nAo,Ao,Ao) 

so that whenp2 > Thi, (0,p2,0) G #(o,i,o); whenp2 < Thi, 
(0,p2,0) G #(0,0,0) (Fig. [Tel. 



'(0,0,0) 


a = 


(0,0,0) 


(1,0,0) 


a = 


(1,0,0) 


(0,1,0) 


a = 


(0,1,0) 


(0,0,1) 


a = 


(0,0,1) 


(1,1,0) 


a = 


(1,1,0) 


(1,0,1) 


a = 


(1,0,1) 


(0,1,1) 


a = 


(0,1,1) 


.(1,1,1) 


a = 


(1,1,1) 



(34) 



Proof: From ( [27] ) we already have Va G and from 
Theorem 1 we know has at least one connected region 
extended from Va- Thus here we only need to prove that 
has only one connected region. 

Take ^(0,0,0) as an example. Let be a connected region 
extended from (0,0,0). Because of the symmetry of the 
region, there is a minimum cube ([0, Thi], [0, Thi], [0, Thi]) 
that includes as shown in Fig. [2a] and the state space 
are split into several cubes. Due to the minimality of cube 
([0,T/il], [0,T/il], [0,T/il]), we have Thi > thi. 

Consider the cube ([0,T/ii], [0,T/ii], [0, 1]), suppose there 
exists another region in it, then \/{x,y,z) G ^'J, line 
Pi = P2 = Thi will pass across both and which 
makes and connected. Therefore, no such region 
exists in cube ([0, T/ii], [0, T/ii], [0, 1]). Similarly, we can 
prove there exists no in cube ([0,T/ii], [0, 1], [0,T/ii]) or 
([0,l],[0,T/ii],[0,T/ii]). 





(e) P3 = (f)P3 = l 

Fig. 1: Structure of the optimal policy on the boundary 



Next we consider the cube ([T/ii, 1], [0, 1], [0, 1]). = 
{x,y,z) e ([T/ii, 1], [0, 1], [0, 1]), since Thi > thi, we have 
"^(1,0,0) (^7 0) > ^(0,0,0) (^7 0)- From equation (12) and 
Lemma 1, we have: 



dp3 ^ dp3 

^^^1,0,0(^^0^^3) ^ g a[a;y(Ai,Ao,T(p3)) + (l-a^)V(Ao,Ao,T(p3))] 
dp3 ^ dp3 

^ (35) 

Q^Bq^q^q{,X,Q,P3) dVB^^QQ{x,Q,P3) 



From d35b we have 



< 



so we 



can tell from Fig. ^) that V < z < 1, V(i,o,o)(^, 0, ^) > 
^(0,0,0) (^5 Likewise, we have: 



■ dVBQ^Q^Q{x,P2,z) 

dp2 

dVB^QQ{x,p2,z) 



Q dV{T{x),T{p2).T{z)) 
^ dp2 





(a) Belief space region segmentation 




P3 
(b) ya(x,0,^) (C) Va(x,y,z) 

Fig. 2: Belief Space Segmentation 



([T/ii, 1], [0, 1], [0, 1]). In the same manner, we can also 
prove that there exists no connected region ^'^^ ^ in cube 
([0,l],[T/ii,l],[0,l]) and ([0, 1], [0, 1], [T/ii, 1]').' Now we 
have proved that there exists on other connected region 



(0,0,0) 



in the whole belief space cube. 



The other 7 regions ^1,0,0, ^ 



(0,1,0), 



^(0,0,1), ^(1,1,0), 



^(1,0,1)' ^(0,1,1) and ^(1,1,1) can be proved in the same way. 



IV. Simulation Based on Linear Programming 

Linear programming is one of the approaches to solve the 
Bellman equation. Based on |[T4ll , we model our problem as 
the following linear programming formulation: 



Vp G X,Va G 
min V(p), s.t. ga{p) + P 



E 



fa{p.y)V{y)<V{p) 
(37) 



where X denotes the belief space, Ap is the set of available 
actions for belief state p. The state transition probability 
/a(p,y) is the probability that the next state will be y when 
the current state is p and the current action is a G Ap. The 
optimal policy is given by 



/3 



d[xViXi,Tip2),Tiz)) + il-x)ViXo,Tip2),Tiz))] 



and from 



From (36) we have 



^^gl,0,o(^'P2,^) 



dp2 



> 



(36) 

dVBQ^o^oix,P2,z) 
dp 2 



Fig. ^c) we can tell that V < y < 1, 

^(1,0,0) (^,^,^) > V(o,o,o)(^,^,^). Therefore, W v = 
{x,y,z) e ([T/ii,l],[0,l],[0,l]), we have v ^ *(o,o,o), 
that is, there exists no connected region ^'L in cube 

' ^ ^(0,0,0) 



^(p) = arg max (^,(p) + /3 ^ /,(p, y)V{y)) (38) 

"^^^ yex 

For ease of discussion and demonstration, we consider the 
case of three-dimensional belief space. We use the LOQO 
solver on NEOS Server UH with AMPL input (161 to obtain 
the solution of equation ( [37] ). Then we use MATLAB to 
construct the policy according to equation ( [38] ). 

Fig. [3] shows the AMPL solution of the value function and 
the corresponding optimal policy. We use the following set of 






(a) *(o,o,o) 



(b) *(i,o,o) 




Fig. 3: Structure of optimal policy. 



(c) *(0,1,0) 




(d) *(o,o,i) 



parameters: Ai 0.9, Aq 0.1, /3 0.9, Ri = 3, i?2 = 2, 
i?3 = 1.78, Ci = 1.5, C2 = 1, C3 = 0.89. Fig. |4] shows 
each of the 8 individual decision regions. We can see clearly 
in the figure that the decision regions have the symmetry and 
contiguity properties we gave in Section III. 

To better understand the optimal policy, we next investigate 
how the parameters Aq , Ai , i?i , i?2 , ^3 , C'l , C2 , C3 affect the 
structure of the decision regions. 

First, we consider the effect of Aq and Ai. Let \^a\ denote 
the volume of ^ define the normalized volume |$a|/|X| 
as the volume of normalized against the volume of the 
total belief space X . Due to the symmetry property of the 
decision regions, we only study the decision regions for the 
following 4 actions (0, 0, 0), (1, 0, 0), (1, 1, 0), (1, 1, 1). For 
ease of notation, in the following discussion we use Bq, Bi, 
B2 and ^3 to denote these four actions, respectively. 

We first fix the value of Ai and increase Aq from 0.1 to 



0.8. Fig. 5a shows the normalized volume of the four decision 
regions with increasing Aq. We can see that initially when 
Aq = 0.1, has the biggest volume, it then decreases 
rapidly when Aq increases. The volume of also changes 
significantly with increasing Aq, but in contrast to ^^3, 
it increases rapidly when Aq increases. When Aq = 0.8, 
has the biggest volume. This trends have the following 
implications: when Aq is small, which means the channels tend 
to remain in the bad state, it is beneficial to allocate power to 
all the channels (choose action B^, = (1, 1, 1)), whilst when 
Aq is large, which means the channel is very likely to change 
from bad state to good state, it is better to "gamble" on one 
channel (choose action Bi = (1,0,0)). 

Similar trends can be observed in Fig. 5b which shows 
the volumes of the four decision regions versus Ai. When 
Ai is small, has the biggest value, which means it is 
optimal to "bet" on one channel when Ai is small. When Ai 
is greater than 0.49, ^^3 overtakes ^Bi, which means when 





(e) *(i,i,o) 



(f) *(1,0,1) 




(g) *(o,i,i) (h) *(i,i,i) 

Fig. 4: Individual Decision Regions 



Ai is big enough it is better for the system to take a more 
conservative action by allocating power to all the channels 
instead of "gambling" on one channel. The interesting thing 
is that I I and | I change only slightly with varying Aq 
and Ai. This implies that in order to maximize the reward, 
the system should either allocate the transmission power to 
all the channels or gamble on one channel. Using part of the 
channels (B2 = (1, 1, 0)) or doing nothing (Bq = (0, 0, 0)) is 
always not a good idea to maximize the long term reward. 

Next we study the effect of immediate reward Rk and 
immediate loss (1 < A: < A^) on the structure of the optimal 
policy. It is straightforward to think that if the ratio of R^/Ck 
is large, the total system reward will be large. Fig. [6] shows 
that when Rk/Ck grows, the normalized volume of and 
^Bs decreases, whist grows with Rk/Ck- decreases 
at first and then increases. For all four actions, the volumes 




(a) Ao 




(b) Ai 

Fig. 5: Normalized \^a\ with varying Aq, (-Ri — 3, R2 

Rk 
Ck 



1.75, = 1.361,^ = 2) 



we fix the value of Rk/Ck, so 



k2Rk 
kiRk 



kiCk^ 



changes along with 



in the same manner). As in Section III, we assume 
that 'Rk^ < Rk^ < k2Rk2/h, Ck^ < Ck^ < k2Ck2/h 
and Rk > Ck{l < ki < k2 < M), so that when more 
channels are chosen in an action, the system obtains larger 
immediate reward kRk, therefore our power allocation scheme 
encourages the system to allocate power to more channels. It 
:^ .u^,.,^ :^ tj:^ hLu^^ — 2 fc^ grows, normalized I^bJ 



is shown in 



Fig.|7| 



that when 



kiRk 



decreases whilst {^BsI increases. Therefore when ^Inl^ 
large, the total immediate reward is large enough for the 
system to act conservatively by allocating the transmission 
power to all the channels. Whilst when ^Inl^ small, the 
total immediate reward is so small that system would rather 
"gamble" on one channel. Like the observation in Fig|6j the 
values of \^Bo \ and | 1 ^^^Y change slightly with varying 



kiRki ' 





(a) Bo 



(b) Bi 



of the decision regions reach a constant level respectively and 
remain unchanged when Rj^ / Ck grows beyond a certain value. 



: 

-I 








(a) Bo 






V 




(c) B2 




(b) Bi 



Fig. 6: Normalized \^a\ vs. Rk/Ck {I < k 
3, R2 = 1.55, R3 = 1.06, Ao = 0.1, Ai = 0.9) 



(d) B3 

< N) (R, 



In fact, we notice in Fig. [6] that the value of Rk/Ck have 
limited effect on the decision regions in terms of percentage 
of each decision region in the whole belief space. Now we 
consider the value of ^P^^ and ^^S^ , and try to find 

kiRki kiCk^ 

out how they affect the structure of optimal policy (here 





(c) B2 

Fig. 7: Normalized \^a\ 
0.9, =3,Ci = 1.5) 



(d) 



with changing 



fc2-Rfc2 

kiRki 



(Ao=0.1,Ai 



From the discussion above, we can draw a conclusion that: 
when Ai — Ao and j^^^ are large, the system tends to act 
conservatively and share power among all the channels; when 
Ai — Ao and j^^^ are small, the system tends to "gamble" 

rSl J~ik-^ 

on one channel. No matter how the parameters change, action 
B2 is a mediocre choice and bring medium reward thus this 
action is not often taken. Action Bq is seldom chosen by the 
system since it brings no immediate reward, it is chosen only 
when the belief is so small that the system is almost sure to 
suffer loss. 

V. Conclusion 

In this paper, we have studied the power allocation problem 
over N{N > 3) Gilbert-Elliott channels. We have theoretically 
derived the threshold-based structure of the optimal policy for 
N = 3, and graphically illustrated the structure by formulating 
and solving a linear programming formulation of the problem. 



For > 3, it is difficult to demonstrate the results graphically, 
but it is possible to derive the structure mathematically, and 
we will work on this issue in the future. For future work, 
we would also like to investigate the case of non-identical 
channels and use a multi-armed bandit (MAB) formulation to 
find the thresholds for multiple channel system with N > 3. 
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